In this practice, we will learn Regularizations and how to perform Hyper-Parameters Search.
We will also learn how use Ensembles to get predictions from more than one model, and how to use the models KNN and LWLR (In the next practice).
We will learn how to get code from other developers from PyPi or GitHub and use it in our work.
We will also learn how to use AutoViz to show fast DataFrame graph reports.

The Python Package Index (PyPI) is a repository of software for the Python programming language.
PyPI helps you find and install software developed and shared by the Python community.
Package authors use PyPI to distribute their software.

GitHub is a code hosting platform for version control and collaboration.
It lets you and others work together on projects from anywhere.
It is also a common way to share the code you wrote with other developers.
Sometimes, the regular packages we are using are not enough, and we want to use things that are not officially implemented yet.
We can write them down ourselves, or search for implementations written by other developers.
These implementations will mostly be hosted at GitHub.
If we want to use them and their developers did not upload them to PyPi (to be downloaded easily with pip), we need to download them directly from their repositories in GitHub.
We will use git clone to clone repositories into our machine.
We update packages that their Colab version is too old.
!pip install --upgrade plotly
!pip install autoviz
We import our regular packages.
# import numpy, matplotlib, etc.
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
# sklearn imports
from sklearn import metrics
from sklearn import pipeline
from sklearn import linear_model
from sklearn import preprocessing
from sklearn import neural_network
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import LeavePOut
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
We use the Vinho Verde's White Wines dataset.

Source:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
In the above reference, two datasets were created, using red and white wine samples.
The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts).
Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
Relevant Information:
The dataset is related to white variants of the Portuguese "Vinho Verde" wine.
Only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as a classification or regression task.
The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones).
white wine samples - 4898.
Number of Attributes - 11 + output attribute
Input variables (based on physicochemical tests):
Output variable (based on sensory data):

Let's download the dataset from Github and explore it with Pandas tools.
# download whitewines.csv file from Github
!wget https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/whitewines.csv
# load the whitewines csv file
whitewines_df = pd.read_csv('/content/whitewines.csv')
whitewines_df
# show whitewines_df info
whitewines_df.info()
# show whitewines_df description
whitewines_df.describe()
We can also use autoviz to show a report on the data.
This report is based on graphs other than text.
# import autoviz and show report on usedcars_df
from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()
dft = AV.AutoViz("", depVar='quality', dfte=whitewines_df, verbose=1)
# divide the data to features and target
t = whitewines_df['quality'].copy()
X = whitewines_df.drop(['quality'], axis=1)
print('t')
display(t)
print()
print('X')
display(X)
Let's see how many different scores we have in the target.
t.unique()
We can see that we have only 7 different scores (ordinal target).
We can try to use regression models and classification models on this dataset.

When we have a case of High-Variance, we can use Regularizations, to reduce it.
With regularization, we can control the size of the model weights and thus control the variance.
We control the size of the model weights by adding a term to the loss function.
This term is called a penalty to high weights.
We have learned about three regularization techniques:

The penalty is fixed for every weight.
It is going up or down with fixed-size steps.
The unnecessary weights tend to get to zero (it is practically feature selection).

When the weights are big, this penalty adds big numbers to the loss function and thus reducing the weights by a lot.
When the weights are small, this penalty adds small numbers to the loss function and thus reducing the weights a little bit.
The weights that are unnecessary still stays in the model (but with small values).


Combination of L1 (Lasso) and L2 (Ridge).

We can use these regularizations as Hyper-Parameters in Scikit-learn SGDRegressor and SGDClassifier.
Wen can also use regular GD with Scikit-learn Lasso, Ridge and ElasticNet.
Let's start with SGDRegressor.
# print lasso, ridge and elasticnet scores as regression
from sklearn.model_selection import cross_val_score
sgd_lasso_reg = SGDRegressor(penalty='l1', random_state=1)
sgd_ridge_reg = SGDRegressor(penalty='l2', random_state=1)
sgd_elastic_reg = SGDRegressor(penalty='elasticnet', random_state=1)
print("R2 score for regression:")
print('sgd_lasso', cross_val_score(make_pipeline(StandardScaler(), sgd_lasso_reg), X, t, cv=15).mean())
print('sgd_ridge', cross_val_score(make_pipeline(StandardScaler(), sgd_ridge_reg), X, t, cv=15).mean())
print('sgd_elastic', cross_val_score(make_pipeline(StandardScaler(), sgd_elastic_reg), X, t, cv=15).mean())
Let's check the accuracy score of the regression models.
We can do it with Scikit-learn make_scorer.
# create accuracy score for ordinal predictions
from sklearn.metrics import make_scorer, accuracy_score
def get_accurate_ordinal_preds_from_numeric_preds(preds, min=None, max=None):
if min is None:
min = round(min(preds))
if max is None:
max = round(max(preds))
preds = np.asarray(preds).ravel()
return np.array([round(p) if min <= p and p <= max else min if p < min else max for p in preds])
def accuracy_for_ordinal(t, y):
min_ord = min(t)
max_ord = max(t)
y_ord = get_accurate_ordinal_preds_from_numeric_preds(y, min=min_ord, max=max_ord)
return accuracy_score(t, y_ord)
print("Accuracy score for regression:")
print('sgd_lasso', cross_val_score(make_pipeline(StandardScaler(), sgd_lasso_reg), X, t, cv=15, scoring=make_scorer(accuracy_for_ordinal)).mean())
print('sgd_ridge', cross_val_score(make_pipeline(StandardScaler(), sgd_ridge_reg), X, t, cv=15, scoring=make_scorer(accuracy_for_ordinal)).mean())
print('sgd_elastic', cross_val_score(make_pipeline(StandardScaler(), sgd_elastic_reg), X, t, cv=15, scoring=make_scorer(accuracy_for_ordinal)).mean())
Let's try the classification approach and use SGDClassifier.
# print lasso, ridge and elasticnet scores as classification
sgd_lasso_cls =SGDClassifier(penalty='l1', random_state=1)
sgd_ridge_cls = SGDClassifier(penalty='l2', random_state=1)
sgd_elastic_cls = SGDClassifier(penalty='elasticnet', random_state=1)
print("Accuracy score for classification:")
print('sgd_lasso', cross_val_score(make_pipeline(StandardScaler(), sgd_lasso_cls), X, t, cv=15).mean())
print('sgd_ridge', cross_val_score(make_pipeline(StandardScaler(), sgd_ridge_cls), X, t, cv=15).mean())
print('sgd_elastic', cross_val_score(make_pipeline(StandardScaler(), sgd_elastic_cls), X, t, cv=15).mean())
We can see that the classifiers predicted here worse than the regressors.
We can try and create ordinal classifiers that will use the fact that there is an order to the labels.
from sklearn.base import clone
import sklearn.metrics
class OrdinalClassifier():
def __init__(self, clf, class_opt=None):
self.clf = clf
self.clfs = {}
self.class_opt = class_opt
def fit(self, X, y):
self.unique_class = np.sort(np.unique(y))
if self.unique_class.shape[0] > 2:
for i in range(self.unique_class.shape[0]-1):
# for each k - 1 ordinal value we fit a binary classification problem
binary_y = (y > self.unique_class[i]).astype(np.uint8)
clf = clone(self.clf)
clf.fit(X, binary_y)
self.clfs[i] = clf
def predict_proba(self, X):
clfs_predict = {k:self.clfs[k].predict_proba(X) for k in self.clfs}
norm_unique = self.unique_class - min(self.unique_class) # Added this variable to make the model work with the wine data
predicted = []
for i,y in enumerate(norm_unique):
if i == 0:
# V1 = 1 - Pr(y > V1)
predicted.append(1 - clfs_predict[y][:,1])
elif y in clfs_predict:
# Vi = Pr(y > Vi-1) - Pr(y > Vi)
predicted.append(clfs_predict[y-1][:,1] - clfs_predict[y][:,1])
else:
# Vk = Pr(y > Vk-1)
predicted.append(clfs_predict[y-1][:,1])
return np.vstack(predicted).T
def predict(self, X):
return (np.argmax(self.predict_proba(X), axis=1) + min(self.unique_class)) # Adding back the min value to align the predictions to the data
def score(self, X, t):
y = self.predict(X)
return np.array(metrics.accuracy_score(np.array(t), y))
from sklearn.model_selection import cross_val_score
ord_classifier_ridge = OrdinalClassifier(SGDClassifier(penalty='l2', random_state=1, loss='log'), np.array([i for i in range(11)]))
ord_classifier_lasso = OrdinalClassifier(SGDClassifier(penalty='l1', random_state=1, loss='log'), np.array([i for i in range(11)]))
ord_classifier_elastic = OrdinalClassifier(SGDClassifier(penalty='elasticnet', random_state=1, loss='log'), np.array([i for i in range(11)]))
print("Accuracy score for:")
print("ord_classifier_ridge:", cross_val_score(make_pipeline(StandardScaler(), ord_classifier_ridge), X, t, cv=15, scoring=make_scorer(accuracy_score)).mean())
print("ord_classifier_lasso:", cross_val_score(make_pipeline(StandardScaler(), ord_classifier_lasso), X, t, cv=15, scoring=make_scorer(accuracy_score)).mean())
print("ord_classifier_elastic:", cross_val_score(make_pipeline(StandardScaler(), ord_classifier_elastic), X, t, cv=15, scoring=make_scorer(accuracy_score)).mean())
So we have seen two methods to create an ordinal model.
The first method is to create it out of a regression model, and the second one is by using a classifier (The second method is explained in the above article).
We can see that the first method works a little better on this data.
Most of our models have a lot of parameters that can be adjusted.
Each parameter value can make our model better (or worse).
We want to be able to find the best hyperparameters for our models.
We have two approaches:
When we want to check every parameter possible, we will use Grid Search.
We will try all combinations of parameters and find the best one, that gives us the best score.
This may be a little exhaustive, especially when we want to check a lot of parameters and values.
We can choose to get random combinations of parameters and check the score on them.
This will not be as accurate as Grid Search, but it will take less time.
Let's use Scikit-learn GridSearchCV.
# train with grid search and get best parameters
from sklearn.model_selection import GridSearchCV
X_normalized = StandardScaler().fit_transform(X)
hyper_parameters = {'penalty': ('l2', 'l1', 'elasticnet'), 'alpha':[0.0001, 0.001, 0.01, 0.1]}
gs_model = GridSearchCV(SGDClassifier(random_state=1), hyper_parameters).fit(X_normalized, t)
print('Accuracy score for classification:')
print('gs_model', gs_model.best_score_)
print('best params', gs_model.best_params_)
We can see that the best parameters on this model (obtained with Grid Search) were penalty=elasticnet and alpha=0.001.
It may change with a different random_state.
Now let's try Scikit-learn RandomizedSearchCV.
We will use Scipy stats.uniform to get uniformly randomize values of alpha.
We will use Numpy random.seed to make sure that we get the same result each time we run this cell.
# train with random search and get best parameters
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
np.random.seed(1)
distributions = dict(alpha=uniform(loc=0, scale=1), penalty=['l2', 'l1', 'elasticnet'])
rs_model = RandomizedSearchCV(SGDClassifier(), distributions, random_state=1).fit(X_normalized, t)
print('Accuracy score for classification:')
print('rs_model', rs_model.best_score_)
print('best params', rs_model.best_params_)
The best parameters on this model (obtained with Random Search) were penalty=l2 and alpha=0.417022004702574.
The Accuracy score of the Randomized Search is a little less than the score of the Grid Search, but it may change with a different random seed.
We can use a collection of models to make more accurate predictions and lower the variance.
If we use regression, we can take the mean of all the predictions of the models.
If we use classification, we can take the mean of all the probabilities of the model or choose the class that most of the models chose for some sample.
It is like "The wisdom of the crowd".
One model may be wrong, but a lot of different models are less prone to errors.
We are going to use two types of ensembles:
We create a few bags of samples from the original dataset.
We train a model on each of the bags of samples, and we return the combined score.
The bags can be created using NFold (same as KFold CV).
we simply divide the data into N parts and use KFold CV (where K=N).
We save all the N (or K) models and use them as an ensemble.
The bags can be created using Bootstrap.
We draw samples out of the dataset (with replacement) and train the model on each group of samples.
We save all the models and use them as an ensemble.
We create a model and train it on the data.
We take the samples that the model predicted incorrectly and multiply them (thus giving them more weight in the next training).
We do this until we have few models, each of them is an expert on some type of samples.
We combine all the model's predictions and return a combined score.
There are few boosting algorithems (AdaBoost, GradientBoost, etc.).
Let's start with Scikit-learn BaggingClassifier.
When we use it with bootstrap=False we will address it as NFold, and when we use it with bootstrap=True we will address it as Bootstrap.
Let's start with NFold Bagging.
# get score with nfold bagging
from sklearn.ensemble import BaggingClassifier
bag_fold_model = BaggingClassifier(base_estimator=SGDClassifier(), n_estimators=20, random_state=1, bootstrap=False).fit(X_normalized, t)
print('Accuracy score for classification:')
print('bag_fold_model', bag_fold_model.score(X_normalized, t).mean())
Let's try the Bootstrap Bagging.
# get score with bootstrap bagging
bag_boot_model = BaggingClassifier(base_estimator=SGDClassifier(), n_estimators=20, random_state=1, bootstrap=True).fit(X_normalized, t)
print('Accuracy score for classification:')
print('bag_boot_model', bag_boot_model.score(X_normalized, t).mean())
In our case, NFold Bagging got better results but the difference is small and it may change if we use bigger n_estimators.
Let's try AdaBoosting.
# get score with ada boosting
from sklearn.ensemble import AdaBoostClassifier
ada_boost_model = AdaBoostClassifier(n_estimators=100, random_state=1).fit(X_normalized, t)
print('Accuracy score for classification:')
print('ada_boost_model', ada_boost_model.score(X_normalized, t).mean())
In our case, the bagging ensemble performs best.
We can use a special form of Linear Regression.
This form is called Locally Weighted Linear Regression.
This is the equation:

We can see that we added a weight (beta) for every sample.
This weight can help us emphasize the importance of the samples that are similar to our test sample.
For every test sample, we train the model from scratch and give each training sample a weight that is corresponding to the distance from the test sample.
We can use the Gaussian function as weight:

If tao is small, only closer samples are taking into consideration in the WMSE (big distances get small weight (beta)).
If tao is big, we get our regular MSE (big distances affect as much as small distances).

Scikit-learn does not have LWLR in its arsenal, so we need to use a packege from GitHub.
This package is not stored in PyPi, so we can not download it with pip install.
We need to use git clone .
# clone the lwlr repo from github
!git clone https://github.com/qiaochen/CourseExercises
Let's check the LWLR model with k=1.
# get cv score for lwlr with k=1
from CourseExercises.lwlr import LWLR
arr_X_normalized = np.asarray(X_normalized)
print('R2 score for regression:')
print('lwlr', cross_val_score(LWLR(k=1), arr_X_normalized, t, cv=5, scoring='r2').mean())
print()
print('Accuracy score for regression:')
print('lwlr', cross_val_score(LWLR(k=1), arr_X_normalized, t, cv=5, scoring=make_scorer(accuracy_for_ordinal)).mean())
Let's get the best k for the LWLR model.
It may take some time (we are building the model from scratch for every test sample), so let's check how long it takes.
%%time
# get best k for lwlr (show the calculation of this cell)
hyper_parameters = {'k': list(range(1, 10))}
gs_lw_model = GridSearchCV(LWLR(k=1), hyper_parameters, scoring='r2').fit(arr_X_normalized, t)
print('R2 score for regression:')
print('gs_lw_model', gs_lw_model.best_score_)
print('best params', gs_lw_model.best_params_)
print()
gs_lw_model = GridSearchCV(LWLR(k=1), hyper_parameters, scoring=make_scorer(accuracy_for_ordinal)).fit(arr_X_normalized, t)
print('Accuracy score for regression:')
print('gs_lw_model', gs_lw_model.best_score_)
print('best params', gs_lw_model.best_params_)
We can take the idea we saw in LWLR to the extreme level and create a model that predicts only based on the closest training samples to a test sample.
This model is called K Nearest Neighbors.
We can choose the k and the model will calculate the prediction for each test sample, based on the closest k training samples to the test sample.
We need to determine what is the meaning of close.
We need to use some sort of distance function to determine the closeness of each training sample to the test samples.
We can use the Euclidean distance:

We can use KNN in classification tasks or regression tasks.
When we use it in a regression task, we can take the mean of the target of all the neighbors.
When we use it in a classification task, we can take the mean of the probability of all the neighbors, or we can use voting, and choose the label that got the most votes from the neighbors.
We can choose to give all the neighbors that are in the decision group, the same weight in the vote, or we can give the closest neighbors higher weight than the farthest.

Let's use Scikit-learn KNeighborsClassifier.
# run KNN on the dataset and find best K by accuracy
from sklearn.neighbors import KNeighborsClassifier
hyper_parameters = {'n_neighbors': list(range(1, 20))}
gs_neigh_model = GridSearchCV(KNeighborsClassifier(n_neighbors=5), hyper_parameters).fit(arr_X_normalized, t)
print('Accuracy score for classification:')
print('gs_neigh_model', gs_neigh_model.best_score_)
print('best params', gs_neigh_model.best_params_)
We can see that the best n_neighbors is 16.
let's try the Scikit-learn KNeighborsRegressor.
# run KNN on the dataset and find best K by R2 and accuracy
from sklearn.neighbors import KNeighborsRegressor
hyper_parameters = {'n_neighbors': list(range(1, 20))}
gs_neigh_model = GridSearchCV(KNeighborsRegressor(n_neighbors=5, weights='distance'), hyper_parameters).fit(arr_X_normalized, t)
print('R2 score for regression:')
print('gs_neigh_model', gs_neigh_model.best_score_)
print('best params', gs_neigh_model.best_params_)
print()
gs_neigh_model = GridSearchCV(KNeighborsRegressor(n_neighbors=5, weights='distance'), hyper_parameters, scoring=make_scorer(accuracy_for_ordinal)).fit(arr_X_normalized, t)
print('Accuracy score for regression:')
print('gs_neigh_model', gs_neigh_model.best_score_)
print('best params', gs_neigh_model.best_params_)
The KNeighborsRegressor did better than the KNeighborsClassifier on this dataset.
It might be due to the additional information it has on the order of the classes.
We also used weights='distance' instead of the default weights='uniform', which makes this KNN model act more similarly to the LWLR model we have seen previously.
Guide on how to upload python packeges to PyPi:
How to upload your python package to PyPi
Explanation of few EDA libraries in python:
4 Libraries that can perform EDA in one line of python code
Explanation of Vinho Verde wines:
Portuguese Vinho Verde wine: everything you need to know
Kaggle notebook on the white wine dataset:
KNN for classifying wine quality
Kaggle notebook on the white wine dataset:
Predicting White Wine Quality
Explanation on Lowess smoothing:
Lowess Smoothing: Overview
Explanation on regularization:
REGULARIZATION: An important concept in Machine Learning
Article about the geometry of Ridge and Lasso regularizations:
Regularization and Geometry
Explanation on Ridge, Lasso, and Elastic Net regularizations:
An Introduction to Ridge, Lasso, and Elastic Net Regression
Explanation of the differences between Ridge and Lasso regularizations:
Intuitive and Visual Explanation on the differences between L1 and L2 regularization
Guide on how to use regular classifier as an ordinal classifier:
Simple Trick to Train an Ordinal Regression with any Classifier
A list of predefined scores for Sciking-learn:
Common cases: predefined scores
Explanation on LWLR and a code sample:
Linear Regression: How to overcome underfitting with Locally Weighted Linear Regression (LWLR)
Example of LWLR in python:
Locally Weighted Linear Regression in Python
Short Explanation on LWLR:
ML | Locally weighted Linear Regression